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Direct simulation of biomolecular dynamics in thermal equilibrium is challenging due to the metastable nature 
of conformation dynamics and the computational cost of molecular dynamics. Biased or enhanced sampling 
methods may improve the convergence of expectation values of equilibrium probabilities and expectation 
values of stationary quantities significantly. Unfortunately the convergence of dynamic observables such 
as correlation functions or timescales of conformational transitions relies on direct equilibrium simulations. 
Markov state models are well suited to describe both, stationary properties and properties of slow dynamical 
processes of a molecular system, in terms of a transition matrix for a jump process on a suitable discretiza- 
tion of continuous conformation space. Here, we introduce statistical estimation methods that allow a priori 
knowledge of equilibrium probabilities to be incorporated into the estimation of dynamical observables. Both, 
maximum likelihood methods and an improved Monte Carlo sampling method for reversible transition ma- 
trices with fixed stationary distribution are given. The sampling approach is applied to a toy example as well 
as to simulations of the MR121-GSGS-W peptide, and is demonstrated to converge much more rapidly than 
a previous approach irP. 



I. INTRODUCTION 

Characterization of the conformational dynamics of 
proteins and other biomolecules in thermal equilibrium 
includes the identification of their metastable states, and 
quantification of their populations and transition rates. 
Such a characterization is essential to analyze and poten- 
tially manipulate biologically important conformational 
transitions, including folding, ligand binding, and aggre- 
gation. Unfortunately, a direct observation of dynami- 
cal processes with an atomistic resolution is impossible 
because the scale of conformation dynamics lies well be- 
low the diffraction limit of optical methods. Spectro- 
scopic methods that provide information in atomistic de- 
tail such as X-ray crystallography do usually only pro- 
vide information about static quantities. NMR spec- 
troscopy methods provide only indirect observations of 
dynamical processes via relaxation dispersion correla- 
tions whose interpretation is challenging and do not pro- 
vide direct structural information. Single-molecule spec- 
troscopic methods can probe the dynamical fluctuations 
of one to two observables directly, but they do not reveal 
molecular structures. 

The recent increase in computing power has enabled 
the study of conformation dynamics in atomistic detail 
via direct molecular dynamics simulations^ ^. Nonethe- 
less, the metastable nature of conformation dynamics^!!] 
in combination with the necessary explicit treatment of 
fast degrees of freedom in the numerical integration of 
the equations of motions renders the spontaneous obser- 
vation of rare events on the milliseconds timescales or 
slower difficult. As a result, one faces severe difficulties 
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when trying to converge expectation values of observ- 
ables depending on slow processes, such as the implied 
time scales of large scale conformational changes^. 

The recent years have seen the development of a host 
of biased or enhanced sampling methods to accelerate 
rare events, and thus to permit the efficient exploration 
of the system's relevant conformations and estimation 
of at least its thermodynamic quantities, such as the 
stationary probabilities of states and stationary expec- 
tation values. To name only some of the best-known 
examples, replica exchange or parallel tempering meth- 
ods facilitate the hopping over energetic barriers by ex- 
changing molecular con forma tions between simulations 
at different temperature d 14 * 1 ^ . Flooding methods obtain 
stationary probabilities by filling up the free energy land- 
scape acc ordin g to the frequency of visits by the evolving 
trajector y 16 ! 17 ! . Umbrella sampling^ proceeds by choos- 
ing an appropriate re-weighting function restricting the 
chain to a subspace relevant to the estimation of a cho- 
sen observable. An improved version using the weighted 
histogram analysis method^ guides the simulation along 
a multidimensional hyper-surface specified by a set of a 
priori chosen reaction coordinates^. For a short ped- 
agogical overview of enhanced ensemble methods see^R 
Applications include replica exchange folding studies of a 
Small RNA hairpiiPl, single-copy tempering for trpzip2, 
trp-cage, and the villin headpiece^, as well as recon- 
naissance meta-dynamics for the binding of benzamidine 
to trypsin^!. Examples for problems that have also been 
successfully treated are first and second order phase tran- 
sitions in lattice spin systems^. 

While biased or enhanced sampling methods can 
generate estimates of equilibrium quantities efficiently, 
they usually do not preserve the equilibrium dynam- 
ics. Thus, dynamical observables such as rates or time- 
correlation functions have to be estimated using other 
methods, chiefly from direct equilibrium molecular dy- 
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namics simulations. An approach frequently used to in- 
tegrate and an alyze molecular dynamics data is Markov 
modelin gj 12 l 26 tt S2l. Markov models approximate the con- 
tinuous phase space dynamics in terms of a discrete space 
Markov jump process. A particular advantage of this ap- 
proach is that Markov processes have been extensively 
studied in Mathematics so that there are a large num- 
ber of rigorous results available. The construction of 
Markov models proceeds through first choosing a suit- 
able discretization of conformation space and then es- 
timating a transition probability matrix from counted 
transitions between conformational subsets specified by 
the discretizatiorPfil. Choosing the discretization so as 
to achieve a n acc urate Markov model is a topic of cur- 
rent researcrPSElHlIl As shown irPH the approximation 
error can be bounded and vanishes as the discretiza- 
tion gets finer and the lag time is increased. A re- 
cently outlined variational method^ can be employed to 
approximate relevant spectral properties of the transi- 
tion operator by an application of the famous Rayleigh- 
Ritz principle. An approach using basis functions and 
variational inequalities makes it possible to connect to 
established methods from electronic structure calcula- 
tions and may proof useful in iteratively improving con- 
formation space discretization. For an overview of the 
Markov state model approach to conformation dynamics 
see"*'. The Markov state model approach has been able 
to reconstr uct complex m olecular processes such as pro- 
tein foldingpP^ 12 ^ 32 1 37 ^, natively unstructured protein 
dynamics^, and protein-ligand binding 39 " 43 from com- 
puter generated trajectories. In addition the Markov 
model framework allows the comparison of simulation 
driven predictions with experimental findings in a con- 
sistent manneJ^HIZl 

Since enhanced and biased sampling methods can sig- 
nificantly improve the convergence of stationary quanti- 
ties in the presence of long timescales, while direct molec- 
ular dynamics simulations can probe dynamical quanti- 
ties depending on short timescales, it would be desirable 
to combine the advantages of both approaches. A natu- 
ral mathematical basis to foster this combination is de- 
tailed balance of the dynamics. Detailed balance states 
that under equilibrium conditions, the ratio of stationary 
probabilities between two states is equal to the inverse ra- 
tio of transition rates or probabilities between them. On 
the microscopic scale, detailed balance is a natural conse- 
quence of the time inversion invariance of the microscopic 
equations of motion and the Gaussian white noise nature 
of the stochastic fluctuations^ (p. 88fL). When using 
Markov models, microscopic detailed balanced directly 
translates into detailed balance on the level of Markov 
states. Therefore, it would be desirable to include prior 
information of the stationary distribution into the esti- 
mation of dynamical observables such as correlation func- 
tions and time scales or rates of conformational changes. 
One could for example use well converged equilibrium 
probabilities estimated on conformational subsets consti- 
tuting a suitable discretization from an extended ensem- 



ble simulation and generate observations of equilibrium 
fluctuations from a standard equilibrium simulation. The 
precise knowledge of the stationary probabilities could for 
example be used to obtain sharper estimates of dynam- 
ical quantities such as timescales for large scale confor- 
mational transitions. 

Detailed balance is now commonly used as a con- 
straint to guide the maximum likelihood estimation of 
Markov m odel t ransition matrices from observed transi- 
tion counts^ESl However, these existing approaches do 
not permit to explicitly include prior knowledge of the 
stationary distribution. Beyond maximum likelihood es- 
timates, the estimation of statistical uncertainty stem- 
ming from the fact that only finitely many transition 
counts have been observed, is crucial to allow a mean- 
ingful comparison with expectation values obtained from 
other simulations as well as with observations from ex- 
periments to be made^l. Furthermore, quantification 
of statistical uncertainties is a prerequisite to guide an 
adaptive samplin g approach that aims at reducing them 
efficienth. * 49 * 51 * 52 !. In Singhal et alP^ direct sampling of 
transition matrices was applied to calculate the distri- 
bution of mean first passage times. A computationally 
efficient procedure to estimate the variance together with 
the mean based on a Gaussian approximation of the dis- 
tribution of transition matrices and a first order Tay- 
lor expansion of the target observable was also devel- 
oped. IrP^the method was extended to the estimation of 
eigenvalues and eigenvectors. IrP^, a similar perturbation 
method was used to evaluate the statistical error of com- 
mittor probabilities. IrP a related approach based on 
perturbation theory of spectral subspaces is developed in 
order to achieve a refinement of a grid-free conformation 
space discretization. A full Baycsian approach for esti- 
mating statistical errors including the detailed balance 
constraint was introduced irW In a subsequent study, we 
have extended the formalism by also including statisti- 
cal uncertainties of spectroscopic observables^. Ref!^ 
has used a different approach, an edge reinforced ran- 
dom walk, to sample reversible transition matrices. As 
yet, the Markov chain Monte Carlo approach in RefP is 
the only approach that permits to explicitly include prior 
knowledge of the stationary distribution into the estima- 
tion of the probability distribution of transition matrices. 
However, this sampler has rather poor mixing properties, 
thus requiring many iterations and a high computational 
load before the probability distributions can be estimated 
reliably. 

In the following we will introduce efficient methods to 
include prior knowledge of the stationary distribution 
into reversible transition matrix estimates: (I) Maxi- 
mum likelihood estimation methods are given that either 
solve a constrained convex optimization problem using 
standard optimization libraries, or proceed via an iter- 
ative likelihood maximization algorithm. (2) An effi- 
cient Gibbs method is introduced to sample the condi- 
tional densities of individual transition matrix elements, 
offering improved convergence properties over the pre- 
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vious approach in Ref.^. The estimation and sampling 
methods described here are implemented in the EMMA 
Markov model toolkfPfil. The maximum likelihood es- 
timation for fixed stationary distribution can be per- 
formed using the EMMA command misestimate and 
the Gibbs sampling of reversible transition matrices with 
fixed stationary distribution is available via the command 
mm_transitionMatrixSampling. 



II. PROBABILITY DISTRIBUTIONS FOR TRANSITION 
MATRICES 

If one has at hand only a finite observation X\ , . . . , Xn 
of a Markov jump process there are usually an infinite 
number of transition matrices P that are compatible with 
the given data. In the following we assume that one 
can directly observe transitions between individual micro 
states i € l,...,n. A single entry p t j of a transition 
matrix quantifies the probability to make a transition to 
state j given that you have started in i, 

Pij =F(X k+1 =j\X k =i). 

If the micro state jump process is Markovian the prob- 
ability of observing a certain realization of the process 
X\ , . . . , Xn depends only on the number of transitions 
between pairs of states in X\ , . . . , Xjq ■ Thus the matrix 
of transition counts C completely determines the proba- 
bility of a given observation for a fixed P, 



p{X 1 ,...,X N \P)=p(C\P). 



(1) 



As a result of Markovianity the probability of observing 
transition counts Cjj given a set of transition probabilities 
Pij is given by the multinomial distribution 



p(C\P) 



IK 



(2) 



However, we need the probability of a certain transition 
matrix given an observation of transition counts, p(P\C). 
Bayes' theorem can be used to relate p(C\P) to p(P\C) 



via 



p(P\C)~p(C\P)p(P). 

Using a suitable conjugate prior with prior counts bij, as 
outlined irP^l, we find that this probability is given by a 
product of Dirichlet distributions 



p(p\c) ~n? 



Cij+bi 
ij 



(3) 



Here the following normalization condition for row- 
stochasticity of P is assumed to hold, 



k=l 



1 



1, 



(4) 



The structure of (J3j) makes it possible to generate inde- 
pendent Dirichlet distribute d row's if no additional con- 
straints on P are imposed- 51 * 52 * 57 * 58 ! If one desires to 
restrict the space of all admissible transition matrices to 
those obeying a detailed balance condition 



(5) 



the additional interdependence between rows prohibits to 
generate samples from ([3 ) by direct sampling of individ- 
ual rows. IrP a Metropo is Hastings Monte Carlo chain 
method is developed to generate random transition ma- 
trices from ([3]) under the detailed balance constraint. In 
the following we will only consider the situation in which 
the stationary probabilities have been already computed 
using a different simulation algorithm. Please note that 
fixing 7Ti, . . . , tt„ and requiring detailed balance reduces 
the number of independent variables p^ from n(n — 1) 



to 



n(n— 1) 



This is a 50% reduction in dimension and 



we expect that imposing this extra symmetry will have 
a large effect when comparing quantities estimated with 
and without these constraints. In the following we will 
use the normalization condition Q to determine the di- 
agonal of P from the off-diagonal elements, 



= 1 - ^Pik 

k^i 



1 = 1, 



and the detailed balance condition ([5| in combination 
with the fixed stationary vector to determine the lower 
triangular part of P from the upper triangular one, 



Pji 



Pij 



1 < i < j < n. 



This approach for incorporating a priori knowledge 
about stationary probabilities has a straightforward gen- 
eralization to situations in which the stationary proba- 
bilities are not precisely known. If one has obtained a 
probability model for the stationary probabilities 

P (tt\E) 

from an enhanced sampling method E one can incorpo- 
rate this prior knowledge of it into a probability model 
for P. The probability model for P given the evidence C 
and E is given by 

p(P\C,E) = J dnp(P\C,7r)p(n\E). 

P ~ p(P\C, E) can be sampled by iteratively generating 
samples of it from P (tt\E) and of P from p(P\C, it). 



III. CONDITIONAL PROBABILITIES 

The Gibbs sampling strategy facilitates sampling of 
a joint distribution by generating random variates from 
the conditionals. In the following we will show that for 
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a fixed stationary vector (jti, . . . , 7r n ) all the conditionals 
of p(P\C) have a simple analytical form. Furthermore 
we will outline a method to generate random variates 
efficiently from all conditionals for all possible configura- 
tions of P and 7r. For the sake of brevity of notation we 
will often supress the fixed observation C when stating 
relations for the conditionals. The conditional density 
for pij given {pki}k^i, 3 ,i=£i,j is 

PiPijlPk&j^ij) ~ /'',, />',, l'„ I',,- 
Plugging in the constraints Q, ([5| we get 

piPijlPk^ljtij) ^P^ +Co1 

X ((1 - Pik)-Pij) C " 

x ((!- Z! Pi*)~— PijT" 

explicitly showing the unvariate dependence on pij . Now 
we define 

A u = C 1 - £ ft*). ( fi ) 
M*>i 



Atf = ?(l- Eft*)- (7) 

Using these we can rewrite the conditional density as 

p(Pij\Pk&,j,ljti.j) ~Pij Cji (Aij -Pi 3 f"{K] -Pij) Cjj - 

(8) 

Without loss of generality we assume that < Aij. 
Then we can define 



Ay 

and define the following parameters, 

C 7 i i 



6 = 

C : 



C J3 ' 

A^- 



(10) 

(11) 

(12) 



Please note that a,b,c > and d > 1. After a little 
algebra we get 



p(x|a, b, c, d) - x Q (l - x) b (d - x) c 



(13) 



with < x < 1. This means that if we can generate ran- 
dom variates from p(x\a, b, c, d) efficiently for all admissi- 
ble parameters, we can efficiently sample all conditional 
densities arising during a Gibbs sampling procedure. The 
dependence of the conditionals for on both Cjj and Cji 
clearly reflects the additional symmetry imposed by the 
detailed balance condition. 



A. Log-concave densities 

We can write the density p(x\a, b, c, d) in the following 
way, 



p(x\a, b, c, d) 



_ q(x\a,b,, 



with 



q(x\a, b, c, d) — alog(x) + Mog(l — x) + c\og(d — x). 
The second derivative of q(x\a, b, c, d) is given by 

hi I u a\ a b c 

{xlaAc ' d) = -^-Jl^-Jd^xf- 

It is easy to see that 

q"(x\a,b,c,d) < 

for all < x < 1 and all parameters a, b, c > and 
d > 1. This is a sufficient condition for q(x\a, b, c, d) 
to be a concave function and therefore all conditionals 
p{x\a, b, c, d) fall into the category of log-concave densi- 
ties. There exist efficient approaches for the generation of 
random variates from a log-concave density given explicit 
knowledge of the mode point and the ability to evaluate 
the density p{x) and the first derivative of its logarithm 
q(x) = logp(x). For an overview of methods to sample 
from log-concave densities see^. The crucial feature em- 
ployed by all these methods is that any concave function 
q : ft —> R is bounded from above by all its tangents, so 
that for all xq for which q'(xo) exists, the following holds 

q(x) < q(x ) + q'(x ){x — x ) Vx G il. 

Since the exponential function is a monotone function we 
have 

f(x) — e q ^ < e i( x °)+i' ( x o)( x - x o) 

There exist efficient schemes for the generation of ran- 
dom variates from log concave densities. These schemes 
require knowledge of the location of the global maximum 
or mode point. The global maximum or mode point of 
p(x\a, b, c, d) is attained at x m with 

q'(x m \a, b, c, d) = 

subject to the constraint < x m < 1. We have 

q'(x\a, b, c, d) =a; _1 (l — x)^ 1 (d — x)^ 1 

{a(l — x)(d — x) — bx(d — x) — cx(l — x)} . 

It is obvious that it suffices to find the zeros of 

a(l — x)(d — x) — bx(d — x) — cx(l ~ x). 

This expression is at most quadratic in x for all ad- 
missible parameters. Therefore extremal points of 
q(x\a, b, c, d) are given by 



1 



Xl,2 



2(a + b + c) 



((a + b)d + (a + c) ± y/r) , 
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with 

r =[(a + b)d + (a + c)] 2 - 4(a + b + c)ad. 

Please note that p{x\a, 6, c, d) has zeros at xq — 0, xq = 1 
and xq = d. Recall that d > 1. This means that there 
is one extremal point in [0, 1] and one extremal point in 
[l,d\. Therefore we conclude that x m corresponds to the 
smaller one of the two extremal points, 

= 2(fl + X & - c) ({a + b)d + (a + c) - y/r) ■ (14) 

Please note that the mode point need not lie in the inte- 
rior of the unit interval so that x m = and x m = 1 are 
possible values. 
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B. Optimal piecewise approximation 

We will use a piecewise enveloping function 
g{x\a, b, c, d) bounding p(x\a, b, c, d) consisting of a 
uniform density around the mode point and exponen- 
tial tails elsewhere. For log concave densities f(x) it 
is possible to use the following general approach to 
find an enveloping function g{x) for f(x). Let again 
q(x) = \ogf(x). Consider the following piecewise defined 
function h(x), 

q(xi) + q'(xi)(x - xi) -oo < x < xi 
h(x) = { q(x m ) xi < x < x u . 

q(x u ) + q'(x u ){x - x u ) x u < x < oo 

Here x\ and x u denote the lower and the upper bound 
for a region around x m in which f(x) will be bounded by 
a uniform density f{x m )x[ Xl . Xu \ ( x )- As a consequence of 
concavity the function h(x) is a valid dominating function 
for q{x). Thus g{x) = e h ( x ) is a valid enveloping function 
for f(x). Figure 1 shows p(x) and the enveloping density 
g(x). The optimal choice for x\ and x u is the one that 
minimizes the area between g(x) and f(x) leading to the 
lowest possible rejection rate. One can show^ that an 
%i < %rn and x u > x m is optimal if 



f(x m ) 



/«) 



Here e denotes the Euler number. If f^ 1 is explicitly 
known finding the optimal solution is straightforward. 
Please note that for unimodal (continuous) densities / 
the inverse / _1 is always unique on x < x m and on 

x ^ x m- Unfortunately we do not have an explicit ex- 
pression forp _1 (a;|a, 6, c, d). It is however always possible 
to choose suboptimal points xi and x u at the cost of a 
larger rejection rate. In the case that x m lies on the 
boundary of the domain of f{x) the bounding function 
will be a single exponential function. In the case that 

xi or x u lie outside the domain of definition one restricts 
h(x) by truncating it a the boundary points so that only 
one exponential tail or only the constant part will sur- 
vive. 



Figure 1: The conditional density p(x\a, b, c, d) for 
a = 8.0, b = 2.0, c = 4.0 and d = 30.0 (solid line) and 
the corresponding enveloping function g(x) (dashed 
line). The density was scaled to the mode point value 
p{x m \a, b, c, d) to fit it into the range [0, 1]. 



C. Suboptimal piecewise approximation 

Unfortunately we do not have the inverse to the con- 
ditional density (13). We will use additional information 



about p(x\a, b, c, d)to make a good although suboptimal 
choice for xi and x u . In the following let < x m < 1. 
A second order Taylor expansion of q(x\a, b, c, d) around 
the mode point yields a Gaussian approximation of the 
conditional density, 



, , i / \ I 9 rra) i 
p(x) ps exp < q(x m ) H — (x 



s 



-q"(x m ) ' 



We 



The standard deviation is given by a = 
simply set 

X\ X m <T, x u — x r 



A comparison between the optimal points and the ones 
obtained from the Gaussian approximation to p{x) is 
shown in|Figure 2| 



D. Rejection sampling using the envelope 

It is straightforward to generate random variates from 
the individual pieces of the enveloping function. There 
are fast and reliable implementations for the generation 
of uniform as well as for exponential random variates 
and we can use rejection to sample from pieces of p{x) 
individually. We can decompose p(x) as follows 

p(x) =Pip(x)xio, Xl ](x) +P2P(x)x[x l ,x u ](x) + 
Pzv(x)X[x u ,i\(x) 



G 




Algorithm 1: Sample p(x\a, b, c, d) 



Figure 2: The picture shows the location of the optimal 
points x\ and x* u for p(x) (solid line) and the points xi 

and x u obtained from the Gaussian approximation 
(dashed line). The Gaussian approximation touches the 
density at the mode point. 



with pi denoting the weights of the individual pieces. We 
do not know the discrete probabilities pi a priori but 
there is a simple and efficient algorithm circumventing 
the need for pi altogether. The algorithm does only re- 
quire the weights Wi of the individual pieces of g(x). Due 
to the simple form of g(x) obtaining analytic expressions 
for Wi is straightforward. The following [algorithm 1| is a 
variant of the modified composition method that can be 
found irPl ( p . 69). In the following let Q < xi < x u < 1. 
The treatment of special cases is straightforward but re- 
quires additional branches in the algorithm complicat- 
ing notation. The discrete probabilities of the individual 
dominating pieces Wi are given by 



w, 



with 



dxp( Xl )e^^-^ = 4p\; 

q'(xi) 



(15) 



(16) 



r-2 = I dxp(x m ) = p{x m )(xi - x u ), (17) 

'Xl 



r 3 = dxpix^'^'^^ 



P(X U ) 

-q'(x u )' 



(18) 



IV. MAXIMUM LIKELIHOOD ESTIMATION 

The maximum likelihood estimate T* is the optimal 
point of the likelihood function p(C\P). In other words 



Input: a, b, c, d 
Output: x 

Compute x m using ( 14 1 



a = 



i 



-9"Om) 

Xi = x m ~ a 

X u — X m "t - (J 

Compute Wi, W2, and W3 using (151, (16 1, (17 1, (18 1 
repeat 

Z ~ X[o,i]( x ) 
U ~ X[o,i](x) 
if Z < wi then 
repeat 

y ~ e -i'^i)y 
x = Xi-y 
until x > 
else if Z < wi + W2 then 

V ~ X[a,i]( x ) 

x = xi + (x u — xi)y 

else 

repeat 

y ~ e q ' My 
x — Xi +y 
until x < 1 
end 

until Ug(x) < p(x) 



the given observation C is most likely to be generated 
by the optimal model T* . The multinomial form of 
the likelihood and the linear nature of the constraints 
makes it possible to reformulate the problem of finding 
the maximum likelihood estimate as a convex optimiza- 
tion problem. Thus the global optimum T* can be effi- 
ciently found. For a thorough introduction and exhaus- 
tive overview see^l. 

Please note that log is a strictly monotone function so 
that finding the maximal point of p(C\P) is equivalent to 
finding the maximal point of the log-likelihood function, 

l(C\P) = \og P (C\P). 

Finding the reversible transition matrix with given sta- 
tionary distribution tt maximizing l(C\P) can be stated 
as the following optimization problem. 



minimize 



subject to 



■E 



Cij fog Pi j, 

-Pij < 

n 

J2 pi 



1 



*!=i 

KiPij - KjPji = 



1 < i,j < n, 
1 < i < n, 
1 < i < j < n. 



This is a constrained optimization problem in n 2 vari- 
ables. A reduction in the number of independent vari- 
ables can be achieved by eliminating constraints and ex- 
plicitly incorporating them into the objective function 
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and the remaining constraints. There exist a number of 
numerical libraries for the solution of convex optimiza- 
tion problems. We have used the freely available python 
cvxopt module^. The numerical solution of a convex 
optimization problem is usually facilitated by iteratively 
updating the suboptimal point by solving a system of 
linear equations containing the hrst and second order 
derivatives of the objective function and all non-linear 
constraints as well as the matrices specifying the linear 
constraints. In order to start the iterative scheme one 
needs a valid initial point to start the iteration. In the 
following we will outline how we can compute a reversible 
transition matrix P with fixed stationary distribution 
7r from any given possibly non-reversible transition ma- 
trix Q. Our method is similar to an approach outlined 
irP2. The guiding idea is the mechanism underlying the 
Metropolis-Hastings algorithm transforming an arbitrary 
transition matrix into one that is reversible with respect 
to a given stationary distribution. Denote by ay the fol- 
lowing weights, 



au = mm{l, - J -^ L -}. 



(19) 



These are precisely the weights in the Metropolis- 
Hastings algorithm. Define a new transition matrix PW 
by 



3 (o) = I <Hjqij i + j 



(20) 



Please note that the diagonal elements pa will always 
be greater after such a transformation pa > qa for all 
i = l,...,n. We will use the transformation outlined 
above in order to generate a valid starting point for the 
likelihood maximization scheme. Since Q is arbitrary we 
choose it to be the non-reversible maximum likelihood 
estimator, 



En 
k= 



We generateP^ ) by enforcing the reversibility condition 
with respect to tt using (|l9|,(20). 

As an alternative, the likelihood maximization can 
be performed by iteratively maximizing the conditional 
probabilities of Pij for i < j. This is either done by the 
reversible transition matrix estimator described iii^SJ or 
by the following iterative algorithm, algorith m 2| The 
former algorithm is implemented in by the 

mm_estimateFixedPi command. 



V. A GIBBS SAMPLER FOR TRANSITION MATRICES 
WITH FIXED STATIONARY DISTRIBUTION 

The Gibbs sampling approach as first presented irP^l 
achieves the following. Let p{x\, . . . , x n ) be a given 
joint probability distribution and denote by Pi{xi) the 



Algorithm 2: Iterative maximum likelihood 
estimation with fixed stationary distribution 

Input: 7r, C 
Output: P* 

P = P (0) (7T,C) 
S = P 

while 8 > e do 

for i € {1, . . . , n } do 
for j £ {1, . . . , n} do 
if i < j then 

Compute parameters Ay, Ay, a, b, c, d. 
Mode point x m = x m (a, b, c, d) 
Pij = x m ■ min(Ay, Ay) 



pa = A, 

Pji = ^:Pij 

end 
end 
end 

5= \l(C\P) - 1{C\S)\ 

S = P 
end 
P* = P 



Pi 3 



Pji 



marginal distribution of the i-th variable, Pi(xi) = 
p{xi\x 3= £i). It can be shown that under certain condi- 
tions (Lemma 10.11 irP2|) the ability to generate random 
variates from all conditionals Pi{xi) is sufficient to gen- 
erate samples from the joint distribution p(x\, . . . ,x n ). 
The algorithm can be stated as follows. Denote by 
(x^\ . . . , Xn" 1 ) the k-th random vector generated by the 
algorithm. Then a new sample is generated by "sweep- 
ing" through the vector updating all coordinates from 
the respective conditional densities. In other words, for 
i in 1 n, 



„(*+!) 



' p(xi\x\ 



(fc+1) 



r (fe+l) T ( fe ) 



. ,X. 



There exist several variants of the Gibbs sampling al- 
gorithm, the "random scan" version which picks i from 
{1, . . . , n} at random and returns a new sample after n of 
such updates instead of sweeping through all coordinates 
in succession is especially popular. 

Recall that the conditional distribution of p^ is given 
by ^ with parameters Ay and Ay explicitly given by 
^ and Q. Having obtained an explicit expression for 
p[Pij\pk^i,j,i^i,j) for all i < j we can proceed to con- 
struct a Gibbs sampling algorithm to generate random 
variates P from the joint distribution. In order to start 
the Markov chain we need a valid initial transition ma- 
trix p(°) obeying detailed balance with respect to the 
given stationary distribution (7Tj)i<j< n . The matrix P'°) 
should also be irreducible so that choosing any irreducible 
transition matrix qij and enforcing detailed balance with 
respect to 7r according to ( 20 ) will result in a valid ini- 



tial transition matrix possessing the desired properties. 
However, we recommend to use the maximum likelihood 
estimate obtained above as a starting point for the Gibbs 
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chain to immediately draw transition matrices from re- 
gions of high probabilities. The computation of parame- 
ters during the Gibbs sampling procedure can be simpli- 
fied by noting that 



A 



(fe) 



P 



(fc-i) (fc-i) 



and 



a£> = ^ f. 

7T,; V 



where fc denotes the step in the Gibbs sampling chain. In 
other words coupling between elements is mediated only 
by diagonal elements pa. 



This gives rise to algorithm 3 



Algorithm 3: Gibbs sampling of P with fixed 
stationary distribution 7r 

Input: 7T, C, P^-^ 
Output: P (fc) 
repeat 

i ~ {l,...,n} 

j ~ {l,...,n} 
until i < j 



A (ft) _ (fc-1) 

^(fe) _ TTJ_ 



if A 



(fc-i) 



j / (fc-i) . (fc-i) 



„ <A^then 

A< fc -!) 



A 

c = C 77 



(fc-1) 



traits of metastability. As discussed irf^l metastable 
Markov processes on discrete state spaces can be under- 
stood in terms of nearly uncoupled Markov chains with 
small transition probabilities between blocks defining the 
dynamics within a single metastable subset. For finite 
observations of the metastable process the small proba- 
bilities for transitions between metastable sets and the 
zero probabilities of forbidden transitions might become 
indistinguishable in an ensemble of transition matrix gen- 
erated by a sampling approach with no prior informa- 
tion. If the uncertainties of small but non-zero transition 
probabilities are of the same order than those that corre- 
spond to forbidden transitions it might not be possible to 
recover the desired metastable properties from the gen- 
erated ensemble. We will show how one can enforce the 
generation of an ensemble of transition matrices compat- 
ible with the given observations. 

If one still assumes detailed balance with respect to 
7r observed transitions Cij as well as observed transitions 
Cji indicate nonzero probabilities pij, i < j. Therefore we 



conclude that whenever 



> the probability of 



else 



Pij > is positive, P(p^ > 0) > 0. To enforce sampling 
of metastable transition matrices we require that p^ = 
if c-ij + Cji — 0, i < j, for all P in the sample. In 
other words the sparsity structure of C + C T is enforced 
for all P. This sparsity prior is equivalent to using a 
prior count of -1 on all with cy + Cji = in ^ 

(see supplementary information of 2 ). The sparse prior 
is applied by restricting the sampling algorithm to those 
elements p^ for which + Cji > 0. 



VI. RESULTS 



b = Cjj 
c = C u 
end 

Sample x 



(fc-i) 



x a (l — x) (d — x) c using algorithm 1 



Pij' = x ■ min (A^ 

Ah-l) 



(k) 

V 
(fc) 

M 

,(*0 



(fc-1) A (fc-1) 



„(fc) 



(fc-i) 



„(fc) 



The usual procedure returns only after I such ele- 
mentary steps have been taken resulting in a sequence 
p(°) j pCOj ... j p( Nl ) of transition matrices. Here I de- 
notes the number of independent variables, I — n(n — 
l)/2. 



A. Enforcing sparsity - a prior for metastable dynamics 

The equilibrium dynamics of proteins does often ex- 
hibit the feature of metastability. Thus any transition 
matrix characterizing an approximation via a Markov 
jump process on conformation space should also exhibit 



In the following we will show that the above outlined 
Gibbs sampling converges much faster than the Metropo- 
lis Hastings approach developed ioM Please recall that 
the general Metropolis Hastings approach generates ran- 
dom variates from the density p(x) by choosing proposals 
y conditioned on the current state of the chain x from a 
proposal density q(x, y) and accepts proposed samples 
with the following acceptance probability, 



a(x, y) = min{l, 



p(y)q{y,x) 

p{x)q(x,y) 



}■ 



The crucial difference between Gibbs sampling and 
Metropolis Hastings sampling is that the Metropolis 
chain remains in the current state as long as the proposed 
value is rejected while the Gibbs sampling approach gen- 
erates a new sample at each step. This possibility to 
remain in the current state usually leads to longer corre- 
lation times for the Metropolis chain than for the Gibbs 
chain. Thus one needs to run longer Metropolis chains 
than Gibbs chains to achieve an equal degree of con- 
vergence. On the other hand one needs to be able to 
generate random variates from all conditionals efficiently 
while the Metropolis chain can be advanced using a pos- 
sibly very simple proposal density q(x,y). In the fol- 
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lowing we will compare our current sampling approach 
with the one developed iiP and demonstrate improved 
convergence and more-rapidly decaying autocorrelation. 
We start with a simple model using the following count 
matrix 



C 



100 5 
20 4 20 
8 75 



and stationary distribution 

7T = (0.5,0.01,0.49) . 



(21) 



(22) 



to assess the convergence properties of the two ap- 
proaches. 



A. Conditional distributions 



We have generated a sample of N = 10 6 random vari- 



ates from the conditional density, p(x\a, 6, c, d) (131, for 
various choices of parameters a, b, c, d to demonstrate 
the ability of our new method to correctly generate ran- 
dom variates from all possible conditional densities. In 
|Figurc 3] we compare the shape of each histogram to the 
graph of the exact density function for the same param- 
eter values. The figures clearly indicate that all densities 
have been correctly sampled. 



B. Convergence of mean values and variances 

In order to assess the quality of a Monte Carlo sam- 
pling procedure one usually computes the standard error 
of the mean of an observable O estimated from a finite 
sample generated by evolving the chain for a finite num- 
ber of steps. As observable we choose the value of in- 
dividual matrix elements, O — pij and the value of the 
second largest implied time scale, O — t 2 - We have gen- 
erated a maximum likelihood reversible transition matrix 
of the count matrix ([2l|) with stationary distribution ( 22 ) 
using the algorithm in™ Then n en semble 



100 indepen- 
dent Gibbs samplers using [aTgorithm "3 were used, taking 
N steps in the range 10 2 . . . 10 b , estimating E(pij) and 
E(t 2 ) for each (n ensem bi e , N). Then we have estimated 
the standard deviation over the sample for each (fixed) 
N. See Figure 4a for a comparison of the convergence 



of E(pij) between the two sampling approaches. The 



slowest relaxation timescale t 2 , Figur q4b} is an exam- 
ple for a global observable with a functional dependence 
on all elements p^ so that the expectation value £(£2) is 
a suitable measure to access the convergence of general 
observables. A comparison of the convergence behavior 
is shown in Figure fTb] We can also choose to observe the 
variance of individual matrix elements as well as the vari- 
ance of the second largest implied timescale. The setup 
is identical to the one outlined for mean values. The fig- 
ure also shows the convergence of the variance Y(x) for 



a single matrix element, Figure 4c as well as for the im- 
plied timescale, Figureptd] The figures clearly indicate 



the improved convergence properties of the presented ap- 
proach over the previous MCMC sampler irP} requiring 
two orders of magnitude less sampling steps to achieve a 
similar error level. 



C. Autocorrelation functions 

As another measure of the improved convergence prop- 
erties we compare the mixing times of the MCMC chain 
and the Gibbs chain. We have generated a sample of 
10 transition matrices for the above count matrix, (21 1, 



and stationary distribution, ( 22 ) . Each sample was gen- 



erated by advancing the chain using a single Gibbs or 
Metropolis step. Let X k be the value of the observable 
for the k-th sample. We have estimated the normalized 
autocorrelation function 



Px{n) 



E[(X k -n)(X k+n -n)] 



using the following estimator for the sample autocorrela- 
tion 



N-l-l 



Px(n) 



a 2 (N -I) 



(x fc -/i)(x fcH 



fc=0 



with < n < I. The autocorrelation function for P13 
in Figure f"5a| clearly indicates the faster decay of auto- 
correlations for the Gibbs sampler. The autocorrelation 
function for the second largest implied time scale demon- 
strates a significant improvement over the previous ap- 
proach, see I 5b| The number of steps required in order 
to generate decorrelated samples, ridecorr, has been esti- 
mated by assuming an exponential decay for the auto- 
correlation function, 



p{n) 



The area under the graph of the autocorrelation function 
was computed using the trapezoidal rule and used as an 
estimate for n c i ecorr . Values for ridecorr as well as for the 
corresponding decorrelation time, tdecorr, for observables 
P13 and t 2 can be found in Table I ridecorr is two orders 



of magnitude smaller for the Gibbs sampler than for the 
Metropolis approach. Due to the comparable speed of 
elementary sampling steps for both algorithms the im- 
proved decorrelation constant, ridecorr, leads to a similar 
improvement in decorrelation time. 



D. Application to simulation data 

In order to demonstrate the performance of the transi- 
tion matrix sampling method we have applied the pre- 
sented transition matrix Gibbs sampling algorithm to 
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(g) 



(10 



(i) 



Figure 3: Conditional density p{x) (solid line) for different parameters a, b, c, and d. The histograms show a sample 
of N = 10 6 random variates generated using the method outlined above. First row c = 4, d = 10: (a) a = 5, b = 0, 
(b) a = 5, b = 2, (c) a = 0, b = 2. Second row c = 40, d = 100: (d) a = 100, b = 5, (e) a = 100, 6 = 100, (f) a = 5, 
6 = 100. Third row: (g) a = 0, b = 0, c = 4, d = 10, (h) a = 0.5, 6 = 0.2, c = 40, d = 100, (i) a = 0, b = 3 • 10 4 , 

c = 4- 10 3 , d= 10 4 . 



simulation data for the synthetic peptide MR121-GSGS- 
W. Trajectories were obtained by standard equilibrium 
dynamics simulations of a constant volume ensemble at 
293-fT in explicit water with the Berendson thermostat us- 
ing the Gromac9®l simulation software. Each of the two 
trajectories used has a total length of 4/xs with trajec- 
tory frames separated by a time step of lOps. A detailed 
description of the simulation setup can be found in the 
supplementary information of 46 . The trajectories were 
clustered using regular spatial clustering of RMSD dis- 
tances using EMMA^SJ. A spatial cutoff of 3.5nm resulted 
in a clustering with 107 distinct micro states. In order 
to obtain an estimate for the stationary probabilities of 
each micro state a Markov model with a lag time of 10ns 



was generated using the reversible transition matrix es- 
timator presented irP^l. The stationary distribution was 
obtained from the estimated transition matrix as the left 
eigenvector with eigenvalue 1. A corresponding matrix 
containing transition counts between individual micro 
states was obtained by counting transitions at the same 
lag time. The sparse prior for metastable dynamics pre- 
sented above was used to generate an ensemble of transi- 
tion matrices using both, the Metropolis, and the Gibbs 
sampling procedure. We have started n en sembic = 
independent chains with N steps in the range 10 5 . . . 10 7 
and estimated E(i 2 ) and ¥(£2) for each (n e nsemble> N). 
The standard deviation was estimated over the sample for 
each (fixed) N. In order to speed up the computation we 
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Figure 4: Results obtained for the model system with count matrix (21 1 and stationary distribution (22). Standard 



deviation for estimated mean and variance of observables is plotted against the number of elementary sampling 
steps N. (a) mean transition matrix element E(p 13 ), (b) mean of the second largest implied time scale E(i 2 ), (c) 
transition matrix element variance V(pi3), (d) variance of the second largest implied time scale V(t2). The Gibbs 
sampler introduced here (solid line) convergences faster than the Metropolis chain frorrfll (dashed line) by almost two 
orders of magnitude. For the mean second largest time scale £(£2), (b), the achieved speedup is more than one order 

of magnitude. 



have estimated from a spectral decomposition only af- 
ter I elementary sampling steps. We have chosen I as the 
number of non zero independent transition probabilities 
Pij . Figure 6] clearly indicates the improved convergence 
properties of the presented approach. Here, the Gibbs 
procedure needs one order of magnitude less sampling 
steps to reach a similar error level. A comparison of the 



autocorrelation functions for ti , Figure 7 shows an order 
of magnitude smaller decorrelation constant n^ ecorr for 
the Gibbs sampler compared to the Metropolis sampler 
. Table I shows ndecorr with the corresponding decor- 
relation time tdecorr- Due to the comparable speed of 
elementary sampling steps for both algorithms the im- 
proved decorrelation constant, ndecorr, leads to a one or- 



der of magnitude lower decorrelation time, tdecorr, for the 
Gibbs sampling algorithm. 



E. Computational efficiency 



For large entries in the count matrix c,-j 3> 1 the af- 
fected conditionals will be sharply peaked so that the 
uniform proposal densities used irP will have very low 
acceptance rates. This results in slow mixing chains 
and large autocorrelation times for the Metropolis al- 
gorithm so that much longer chains have to be run in 
order to achieve the same level of convergence. Due to 



the fact that algorithm 3 only use standard distributions 
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Figure 5: Autocorrelation functions for the model system with count matrix (21) and stationary distribution (22). 



(a) autocorrelation function for the transition matrix element P13, (b) autocorrelation function for the second largest 
implied time scale t 2 ■ The number of steps to take until samples are decorrelated ndecorr is two orders of magnitude 
smaller for the Gibbs sampling method (solid line), ndecorr = 3 for p 13 as well as for t 2 , compared to the Metropolis 
sampling method (dashed line), ndecorr — 123 for pi% and ndecorr — 135 for t 2 . 



to envelope the conditionals, generating a random vari- 
ate by rejection sampling from the conditional density is 
efficient enough to allow the generation of long chains. 
The Gibbs sampling algorithm, [algorithm 3| was imple- 
mented using the colt library. The algorithm performs 
10 s elementary sampling steps in 89.1s on a 2GHz Intel 
processor. Performing the same number of elementary 
Metropolis steps takes 83.4s, with a 27% overall accep- 
tance rate. The acceptance rate for the Metropolis step is 
also highly dependent on the specific element. The num- 
ber of steps required in order to generate decorrelated 
samples ndecorr &s well as the wall-clock decorrelation 
time tdecorr is shown in |Tablc"T| as an indicator of compu- 
tational efficiency. The decorrelation time is calculated 



as ndecorr ' ^sample 



with t 



sample 



0.9us for the Gibbs 



sampling algorithm t samp i e = 0.8/is for the Metropolis 
algorithm. 



Gibbs sampler 

tdecor 



Metropolis sampler 

ndecorr tdecorr 



P13 



t-2 



3x3 count matrix from (21 1 

3 2.7/J.s 123 

3 2.7ns 135 

MR121-GSGS-W peptide 
4600 4.14ms 33000 



98.4/is 
108^s 



26.4ms 



Table I: Decorrelation times for estimated mean and 
variances of observables. The results were generated for 
the model system with count matrix (21) and 
stationary distribution (22) (p 13 and t 2 ) 



and for the 
MR121-GSGS-Wpeptide (t 2 only). 



VII. COMPARISON WITH NONREVERSIBLE 
ESTIMATION 

We have used the following 3x3 transition matrix, 



T 



0.99 0.01 0.0 
0.45 0.1 0.45 
0.0 0.01 0.09 



to generate a transition counts C by evolving a Markov 
chain for TV = 5000 steps starting from micro state i = 0. 
To simulate the effect of having used an efficient enhanced 
sampling algorithm, the exact stationary distribution 

fi = (0.4945,0.011,0.4945) 



and the observed transition counts were used to gener- 
ate a sample of 100000 random transition matrices using 
the Gibbs sampling algorithm. For each of the randomly 
generated transition matrix the second largest eigenvalue 
A2 and the corresponding implied time scale t 2 was com- 
puted. For comparison, the observed transition counts 
were used in a similar manner to generate a sample of im- 
plied time scales without prior knowledge of stationary 
distribution and without explicitly enforcing a detailed 
balance condition. It is clearly visible in Figure 8 that 



the estimation procedure including knowledge about sta- 
tionary probabilities gives a more accurate and a much 
sharper estimate of this quantity. 
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Figure 6: Results obtained for the synthetic peptide 
MR121-GSGS-W. Standard deviation of [lithe mean 
implied time scale E^), 6b the implied time scale 



variance V(^2)- The Gibbs sampler (solid line) shows a 
faster convergence than the Metropolis sampler (dashed 
line) for mean and variance of the second largest 
implied timescale £2- 



VIII. DISCUSSION AND CONCLUSION 

We have presented a Gibbs sampling algorithm for 
the generation of transition matrices fulfilling the de- 
tailed balance constraint with respect to a given station- 
ary distribution. The presented algorithm shows a clear 
improvement in convergence speed and autocorrelation 
times over the algorithm presented irW We believe that 
the presented algorithm will be a useful tool for Monte 
Carlo sampling of transition probabilities when a pri- 
ori knowledge about stationary probabilities is available 
in addition to observed transition counts. As already 
pointed out ija enforcing the detailed balance condition 
can lead to an immense reduction in variance of cer- 
tain off-diagonal elements leading to sharper estimates 
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Figure 7: Autocorrelation function for the 
MR121-GSGS-W peptide count matrix. The number of 
steps to take until samples are decorrelated ridecorr is an 
order of magnitude smaller for the Gibbs sampling 
method (solid line), ridecorr — 4600, compared to the 
Metropolis sampling method (dashed line), 

ridecorr = 33000. 
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Figure 8: Histograms for implied time scale £2 
corresponding to second largest eigenvalue A2. The 
histogram of timescales generated by incorporating 
knowledge about stationary probabilities (left 
sub-figure) in comparison to the histogram of timescales 
generated by the standard non-reversible method. It is 
clearly visible that the sample mean (dashed line) gives 
a more accurate prediction of the true value (solid line) 
due to the additional information about stationary 
probabilities. 



for kinetically relevant observables. With a priori esti- 
mates of stationary distributions available from extended 
ensemble simulations and an increased interest in esti- 
mating dynamical quantities of molecular systems from 
Markov model based approaches the outlined algorithm 
will hopefully become a useful statistical tool for the anal- 
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ysis of metastable systems. 

There are several directions for future research. An 
improved scheme for the generation of random variates 
from the conditional density could further speed up the 
algorithm allowing the generation of transition matrices 
for processes with larger state spaces. Larger acceptance 
rates could be already achieved by finding better approx- 
imations to the optimal boundary points xf, x* u in the 
definition of the piecewise enveloping function h(x). In 
fact the only parameter of the conditional density that 
is updated after a new sample is drawn is d. All pos- 
sible values for a, b, c could in principle be calculated a 
priori so that one might find a set of n[n — l)/2 opti- 
mal proposal densities each parametrized by d. Finding 
a transformation removing the parametric dependence of 
the conditionals on d altogether seems unlikely but would 
of course open up the possibility for the design of even 
faster algorithms. 

We are currently pursuing an application of the algo- 
rithm to data sets obtained by standard molecular dy- 
namics simulations together with estimates of the equi- 
librium distribution obtained from enhanced sampling 
algorithms such as meta-dynamics, generalized ensem- 
ble simulations or umbrella sampling to obtain sharper 
estimates of dynamical quantities. 
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